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Transient  Sonar  Signal  Classification  Using 
Hidden  Markov  Models  and  Neural  Nets 

Amlan  Kundu,  Member,  IEEE,  George  C.  Chen,  Member,  IEEE,  and  Charles  li.  Persons 


Abstract — In  ocean  surveillance,  a  number  of  different  types  of 
transient  signals  are  observed.  These  sonar  signals  are  waveforms 
in  one  dimension  (1-D).  The  hidden  Markov  model  (HMM) 
is  well  suited  to  classification  of  1-D  signals  such  as  speech 
[7],  [8].  In  HMM  methodology,  the  signal  is  divided  into  a 
sequence  of  frames,  and  each  frame  is  represented  by  a  feature 
vector.  This  sequence  of  feature  vectors  is  then  modeled  by 
one  HMM.  Thus,  the  HMM  methodology  is  highly  suitable  for 
classifying  the  patterns  that  are  made  of  concatenated  sequences 
of  micro  patterns.  The  sonar  transient  signals  often  display  an 
evolutionary  pattern  over  the  time  scale.  Following  this  intuition, 
the  application  of  HMM’s  to  sonar  transient  classification  is 
propos^  and  discussed  in  this  paper.  Toward  this  goal,  three 
different  feature  vectors  based  on  an  autoregressive  (AR)  model, 
Fourier  power  spectra,  and  wavelet  transforms  are  considered 
in  our  work.  In  our  implementation,  one  HMM  is  developed  for 
each  class  of  signals.  During  testing,  the  signal  to  be  recognized 
is  matched  against  all  models.  The  best  matched  model  identifies 
the  signal  class. 

The  neural  net  (NN)  classifier  has  been  successfully  used  [21-[4) 
for  sonar  transient  classification.  The  same  set  of  features  as 
mentioned  above  is  then  used  with  a  multilayer  percepfron  NN 
classifier.  Some  experimental  results  using  “DARPA  standard 
data  set  F’  with  HMM  and  MLP>NN  classification  schemes  are 
presented.  Finally,  a  combined  NN/HMM  classifier  is  proposed, 
and  its  performance  is  evaluated  with  respect  to  individual 
classifiers. 


I.  Introduction 

The  classification  of  transient  sonar  signals  has  been 
widely  studied  [2]-[61.  The  transient  classification  prob¬ 
lem  is  deemed  difficult  for  a  number  of  reasons:  1)  Short 
duration  of  the  transients  makes  the  classical  frequency  analy¬ 
sis  difficult;  2)  wide  intraclass  variations  due  to  large  variations 
in  the  structures  and  systems  generating  the  transients;  and 
3)  the  effects  of  ambient  ocean  noise  and  the  presence  of 
biologies  and  merchant  ships  lead  to  poorly  separated  class 
boundaries.  The  most  common  type  of  classifier  used  for  this 
task  is  the  neural  net  f21-(4J  though  other  classifiers  have 
been  studied  [2),  [5],  [6],  Fourier  power  spectral  coefficients 
are  widely  used  as  feature  vectors.  Recently,  the  hidden 
Markov  model  has  been  studied  for  sonar  signal  classification 
[5],  (6),  (24).  In  [5],  (6),  AR  model  parameters  are  used 
as  feature  vectors  for  the  HMM  classifier.  It  is  relevant  to 
note  here  that  the  HMM  was  originally  introduced  by  the 
speech  community  (7).  [8|.  In  speech,  the  linear  predictive 
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coefficients  (LPC).  i.e,,  AR  coefficients,  are  successfully  used 
as  the  feature  vector.  However,  sonar  signals  have  their  own 
characteristics.  It  has  been  found  that  no  single  technique 
can  adequately  capture  all  feature  information  for  all  ocean 
acoustic  transients  of  interest  (2J-(6),  [16],  So.  it  is  expected 
that  other  features  could  lead  to  more  interesting  results. 
With  this  view  in  mind,  we  have  experimented  with  the 
HMM  classifier  and  three  different  feature  vectors  in  this 
paper.  The  feature  vector  based  on  an  AR  model  is  a  natural 
candidate.  As  the  Fourier  power  spectrum  is  widely  used 
by  the  NN  community  for  their  re.search,  these  features  are 
also  considered  (4)  Finally,  wavelet-transform-based  features 
are  considered.  Interestingly,  some  features  based  on  specific 
wavelet  implementation  have  been  used  in  [2],  [3],  It  is  well 
known  that  sonar  transients  are  nonstationary  signals.  The 
wavelet  transform  can  properly  represent  such  signals.  In 
particular,  Daubechies  type  wavelets  are  considered  in  our 
work.  These  wavelets  are  finite  duration  filters  and  quite 
easy  to  implement.  Besides,  these  wavelets  have  not  been 
tried  in  the  context  of  transient  sonar  signal  classification. 
It  is  our  viewpoint  that  these  three  very  different  signal 
representations  for  feature  extraction  would  reveal  some  of 
the  latent  characteristics  of  the  signal  for  better  classification. 

In  speech,  the  spoken  word  manifests  itself  as  a  left-to-right 
concatenation  of  phonemes  |7),  (8),  the  fundamental  speech 
unit.  The  states  in  HMM  are  identified  with  the  phonemes.  As 
a  result,  a  left-to-right  HMM  topology  is  often  preferred  in  the 
application  of  HMM  to  speech  recognition.  This  argument,  in 
our  view,  may  not  hold  in  all  applications  of  HMM  to  .sonar 
signal  cla.ssification.  We  think  of  a  particular  sonar  transient 
as  a  macro  pattern  that  has  evolved  as  a  sequence  of  micro 
patterns.  We  identify  the  “states”  with  the  “micro  patterns.” 
However,  in  the  absence  of  any  other  «  priori  constraint,  the 
macro  pattern  may  be  composed  of  any  sequential  combination 
of  the  basic  micro  patterns.  In  other  words,  a  fully  connected 
HMM  topology,  where  the  transition  from  any  slate  to  any 
other  state  is  possible,  could  be  more  useful  in  such  situations. 
For  the  daia.sel  used  in  our  experiment,  the  fully  connected 
HMM  topology  performs  consistently  better  than  the  left- 
to-right  HMM  topology.  However,  there  are  sonar  signals 
where  the  utility  of  left-to-right  HMM  topology  has  been 
demonstrated  (5). 

Finally,  we  have  studied  Uie  same  set  of  features  with 
a  MLP-NN  classifier  with  the  express  objective  of  finding 
out  the  complementary  nature,  if  any,  of  thc.se  two  classi¬ 
fiers — MLP-NN  and  HMM.  So  far,  a  comprehensive  study 
involving  NN  and  HMM  and  a  number  of  feature  sets  lias 
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•  DARPA  Data  I  •  Autoregressive  Coefficients 

•  Spectral  Coefficients 

•  Wavelet  Coefficients 

Fig.  I.  Block  diagram  representation  of  classification  scheme. 

not  been  undertaken.  In  a  recent  article,  Miller  et  al.  (1) 
have  pointed  out  the  importance  of  exploring  alternative 
technologies  to  NN  in  order  to  make  comparative  performance 
measurements  and  to  obtain  the  best  possible  solutions  to 
signal  processing  and  classification  problems.  We  show  in  the 
current  paper  that  a  combined  classifier  using  HMM’s  and 
MLP-NN’s  is  likely  to  outperform  the  individual  classifiers.  It 
is  relevant  to  note  here  that  the  concept  of  a  combined  classifier 
for  robust  classification  is  well  known  in  pattern  recognition 
theory,  and  has  already  been  tried  with  other  classifiers  for 
transient  sonar  signal  classification  [2].  Fig.  1  gives  the  block 
diagram  of  our  scheme.  Note  that  there  is  no  preprocessing 
involved  in  our  system.  This  is  a  deliberate  decision.  The 
preprocessing  operations  are  often  quite  dependent  on  the 
given  signal.  Usually,  these  operations  try  to  enhance  or 
deemphasize  certain  aspects  of  the  given  signal  for  better 
classification.  A  potential  drawback  of  such  operations  is  that 
when  the  signal  classes  are  changed,  the  old  preprocessing 
schemes  are  often  invalid.  Thus,  to  design  an  automatic  system 
for  transient  signal  classification,  we  will  not  include  any 
preprocessing  operations.  This  design  without  preprocessing 
is  expected  to  make  our  system  suitable  for  a  wide  range  of 
sonar  transients. 

The  remaining  sections  of  this  paper  are  organized  as 
follows:  Section  II  describes  the  implementation  of  AR,  FFT- 
based,  and  wavelet-based  features.  Section  III  presents  the 
theory  and  implementation  of  HMM  as  applied  to  our  clas¬ 
sification  problem.  In  this  section,  some  discussions  on  the 
implementation  of  NN  are  also  included.  Section  IV  discus.scs 
some  practical  considerations  in  implementation.  Section  V 
gives  the  detailed  experimental  results  using  DARPA  standard 
data  set  I.  Section  VI  summarizes  the  conclusions. 

II.  Feature  Representation 

As  described  in  Section  I,  we  have  three  different  feature 
representation  schemes:  one  ba.sed  on  an  autoregressive  model, 
one  based  on  Fourier  power  spectra,  and  the  other  based  on  the 
wavelet  transform.  In  this  section,  these  feature  representation 
schemes  are  briefly  discussed. 


•  Neural  Network  •  Confusion  Matrices 

•  Hidden  Markov  Model 


A.  Autoregressive  Model 

In  describing  the  AR  model,  we  will  use  the  notation  r{l) 
to  denote  the  signal.  The  autoregressive  model  is  a  simple 
prediction  of  the  current  signal  value  by  a  linear  combination 
of  M  previous  signal  values  plus  a  constant  term  and  an  error 
term: 

M 

r{l)  =  a-*rY^ejT{l- j)  +  ^wt  i  =  (2.1) 

r=i 

where: 

r(/):  current  signal  value;  r{l  —  j):  previous  signal  values; 
Oy.  autoregressive  coefficients  to  be  estimated;  M:  model  or¬ 
der;  q:  constant  to  be  estimated;  \/]3:  constant  to  be  estimated; 
«jj:  random  number  with  zero  mean  and  unit  variance. 

ocy  P  are  the  model  parameters;  /?  is  the  vari¬ 
ance  of  prediction  noise  and  reflects  the  accuracy  of  the 
prediction. 

It  is  noted  from  (2.1)  that  to  predict  r(l),  we  need  M  initial 
values  of  r{/),  i.e.,  r(-M  1),  r{-M  -f  2),  •  ■  ■ ,  r(0).  It  is 
easy  to  derive  that: 

9\  R\\  ■  ■  R\m  S\  '  ^  '  Rqi 

Om  Rm\  Rmm  S\i  /?0Af 

.n\  L  .S'l  ■  ■  •  S^,  N  J  L  5o 

where 

L 

Ry,  ^  R„  =  ^r(l  -  i)r(l  -  j),  ?,  J  =  1,  •  ■  ,  M  (2..1) 

1-1 

/. 

■S  ^0-  !  -  ().  1 .  •  ■ ,  Af  (2.4) 

I  1 

and 

r’ 

I  (2. .'ll 
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Fig  2  An  example  of  the  different  classes  of  signals  used  in  our  experiment. 


If  rt  is  zero,  (2.2)  takes  the  form  of  Yule-Walker  equa¬ 
tions  The  R  matrix  is  then  Hermitian  and  Toeplitz  (19). 
A  straightforward  approach  is  to  replace  the  autocorrelation 
functions  in  R  by  sample  autocorrelations.  Often,  a  more 
efficient  algorithm,  known  as  Burg’s  algorithm,  is  used  to 
compute  For  details  regarding  Burg's  algorithm,  please 
see  [19)  It  should  be  noted  that  the  choice  of  optimal  model 
order,  M,  is  application-dependent  and  is  usually  determined 
empirically. 

B.  Fourier  Power  Spectrum 

From  the  given  data  segment,  its  FFT  is  computed.  Before 
FFT  computation,  each  data  segment  is  windowed  with  a 
Kaiser-Bessel  window  function.  The  magnitude  square  of  the 
FFT  coefficients  gives  the  Fourier  power  spectrum  of  the 
data. 


C.  Wavelet  Transform 

In  the  short-time  Fourier  transform,  time  and  frequency 
resolutions  are  fixed.  Because  of  Heisenberg's  principle,  the 
time  and  frequency  resolution  product  cannot  be  better  than  a 
threshold  (IMjr).  In  the  wavelet  representation,  it  is  possible 
to  achieve  high  time  resolution  at  the  cost  of  frequency 
resolution,  and  vice  versa.  This  is  easily  demonstrated  by 
making  the  frequency  resolution  proportional  to  frequency. 
In  the  wavelet  transform,  this  compromise  leads  to  very  high 
time  resolution  for  high-frequency  signals,  and  high-frequency 
resolution  for  low-frequency  signals.  Since  the  sonar  signals 
almost  always  display  an  evolving  frequency  profile  with 
time,  wavelet  tran.sform  representation  is  philosophically  very 
appealing. 

In  the  wavelet  transform,  the  transform  space  is  detined 
by  the  basis  functions,  which  arc  all  derived  from  one  basic 
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wavelet  via  scaling  and  translation;  i.e.,  if  li{t)  is  the  basic 
wavelet  and  /to.rCO  generic  wavelet  basis  function,  then 
[20],  [21]: 

ha,r{t)  =  -  r)/ft). 

When  a  and  t  are  continuous,  there  are  infinite  possibilities 
for  a  and  r.  Consequently,  the  transformation  of  a  signal  r(t) 
using  these  basis  functions,  and  subsequent  reconstruction  of 
x{t)  from  the  transform,  is  a  simpler  task.  The  interesting 
task  is  to  appropriately  discretize  the  time-scale  parameter  a 
and  T  such  that  a  true  orthonormal  basis  function  is  obtained. 
The  solution  depends  on  the  choice  of  wavelet  h{t).  So,  our 
problem  is: 

discretize  -  kT) 

where  T  is  the  sampling  period  of  the  discrete  signal.  Of 
course,  if  we  choose  ao  ~  1  and  T  small,  we  are  close  to  the 
continuous  case.  For  implementation  advantage,  our  interest  is 
in  the  dyadic  wavelet  that  has  ao  =  2,  i.e., 

h,^k{t)  =  2-^/'^h{2-H-kT) 

where  j  and  k  belong  to  the  set  of  natural  numbers.  The 
Daubechies  wavelets  are  a  class  of  discrete  orthonormal  dyadic 
wavelets.  An  M-order  Daubechies  wavelet  (20]  is  given  by 
M  coefficients  denoted  by  Cj,  j  —  0,  •  •  • ,  M  —  1.  Then, 
the  convolution  of  the  signal  with  a  FIR  filter  of  length  M 
i.Cj,  i  =  0, •••M  -  1)  gives  the  smooth  component.  On 
the  other  hand,  the  convolution  of  the  signal  with  a  FIR 
filter  of  length  M  and  coefficients  j  = 

0,  ■  ■  ■  M  -  1  gives  the  detail  component.  After  one  pass  of  this 
algorithm,  the  smooth  and  detail  components  are  decimated 
by  2.  The  smooth  components  are  then  transformed  again, 
and  the  procedure  continues  until  we  have  only  two  smooth 
components  left.  The  output,  at  this  stage,  is  the  wavelet 
transform  of  the  original  signal.  The  coefficients  in  Daubechies 
wavelets  are  obtained  from  orthonormality  conditions  and 
“smoothness  constraints.”  For  an  M-order  wavelet,  these 
conditions  and  constraints  lead  to  exactly  M  linear  equations. 
Thus,  M  coefficients  are  uniquely  determined.  For  more 
discussions  and  details  about  the  coefficients,  see  [20],  [21]. 
For  an  excellent  exposition  related  to  theory  and  applications 
of  wavelet  transforms,  please  see  [23]. 

D.  Feature  Selection 

Ihc  feature  representation  schemes  described  so  far  trans¬ 
form  the  original  signal  into  feature  space.  Since  some  features 
may  be  more  useful  than  others,  only  the  important  features 
should  be  selected  for  a  compact  representation  of  the  signal 
for  classification  purpo.scs.  This  is  a  necessary  data  reduction 
stage.  Tlic  idea  behind  this  stage  is  that  only  a  few  features 
can  discriminate  one  class  from  the  others. 

In  our  scheme,  the  signal  is  divided  into  a  number  of 
overlapping  frames.  For  the  AR-modcl-bascd  feature  represen¬ 
tation,  the  AR  coefficients  are  taken  as  the  feature  vcctoi.  Since 
relatively  few  AR  coefficients  are  needed  to  represent  a  frame, 
AR  feature  representation  is  already  in  very  compact  form.  For 
I-FT  power  spectnim  and  the  wavelet  transform,  the  spectral 


and  transform  coefficients  with  relatively  higher  magnitude 
arc  selected  as  features.  For  instance,  256-point  real  data 
will  give  128  distinct  FFT  power  spectral  coefficients.  For  a 
particular  signal  class,  all  such  FFT  power  spectral  coefficients 
are  analyzed,  and  the  top  few,  say  L  of  them,  in  terms  of 
magnitude,  are  selected  as  features  for  that  signal  c]a.ss.  The 
union  of  all  individual  feature  sets,  each  one  belonging  to 
one  signal  class,  gives  the  global  feature  vector  set  for  all 
signals.  A  similar  procedure  is  used  to  select  the  feature  vector 
from  wavelet  transform  coefficients.  The  details  regarding  the 
number  of  frames  in  a  signal  template,  the  percentage  of 
overlap  among  successive  frames,  and  the  number  of  features 
in  each  feature  vector  are  described  in  Section  V. 

111.  Classifier  Design 

In  our  work,  we  have  used  two  classifiers:  HMM  and 
MLP-f 'N.  In  this  section,  these  two  classifiers  are  described. 

A.  Continuous  Density  HMM 

A  first  order  Af-state  Markov  chain  is  defined  by  an  AT  x  JV 
state  transition  probability  matrix  A  and  an  Af  x  1  initial 
probability  vector  IT,  where: 

A  =  {aj,},  ai,  =  Pr(g(+i  =  j\qt  =  i); 

i,  ;■  =  1,  2, 

n  =  {TTj};  TTi  =  Pr(7i  =  t),  t=l, 

(5  =  {qr,}  -  state  sequence.  qt  G  {1,  2,  ■  •  • ,  Af}, 

<  =  1,2,---,T 

N — number  of  states 
T — length  of  state  sequence. 

By  definition,  Oij  =  1  for  t  =  1,  2, •••,//  and  Xitlj 
TT,  =  1.  A  state  sequence  Q  is  a  realization  of  the  Markov 
chain  with  probability: 

T 

Pr(Q|A,  n)  =  (3.1) 

t=2 

A  hidden  Markov  model  is  a  Markov  chain  whose  states 
cannot  be  observed  directly,  but  can  be  observed  through  a 
sequence  of  observation  vectors  [7].  Each  observation  vector, 
also  called  a  symbol,  manifests  itself  as  states  through  cer¬ 
tain  probability  distributions.  In  other  words,  each  observed 
vector  is  generated  by  an  underlying  slate  with  an  associated 
probability  distribution.  For  solving  our  problem,  we  will 
consider  only  the  ob.servations  with  continuous  probability 
density,  A  continuous  density  HMM  is  characterized  by  the 
state  transition  probability  A,  the  initial  state  probability  11, 
and  an  Nxl  observation  density  or  symbol  probability  density 
vector  13,  where: 

13  -  {bM)}, 

bj(oi)  ~  a  posteriori  density  of  observation  o,  given  qi  ~  j 
O  (oi}  -  observation  sequence.  t  -  1,  L.  ■  .7' 
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In  many  practical  problems,  ii  is  reasonable  to  assume 
that  the  observation  density  is  Gaussian.  In  this  case,  the 
density  is  completely  specified  by  the  mean  and  covariance 
of  o,.  i.e.,  bj{ot)  =  N{nj,  Vj),  where  /tj  and  Vj  are  the 
conditional  mean  vector  and  the  conditional  covariance  matrix, 
respectively,  of  Ot  given  state  g,  =  j.  It  should  be  noted 
that,  in  our  application,  the  states  of  an  HMM  may  not 
have  specific  physical  meaning.  They  may  just  reflect  some 
clustering  properties  of  the  observation  vectors  in  the  feature 
space. 

We  can  more  compactly  denote  the  parameter  set  by  A  = 
{A,  n,  B).  Then,  an  HMM  is  completely  specified  by  A.  Three 
problems  associated  with  HMM  are  of  our  concern: 

1)  Based  on  what  optimization  criterion  should  our  model 
be  built? 

2)  Given  the  model  and  an  observation  sequence  O  = 
{oi,  02,  ••■,07'},  how  can  we  classify  the  observation  effi¬ 
ciently?  This  is  the  classification  problem. 

3)  Given  a  number  of  observation  sequences  of  a  known 
class,  how  can  we  obtain  the  optimal  model  estimate  A?  This 
is  the  training  problem. 

Problem  1 — Optimization  Criterion:  Suppose  we  are  given 
a  model  A  and  an  observation  sequence  O  =  {oi ,  02,  •  •  • ,  ot-}. 
Then,  the  density  function  of  O  is  given  by: 

T 

Q  t=2 

A  direct  choice  of  optimization  criterion  is  the  maximum 
likelihood  criterion  that  maximizes  P(OIA).  The  estimation 
of  the  parameters  by  this  criterion  can  be  solved  using  the 
Baum-Welch  reestimation  algorithm  (7).  The  algorithm  is  an 
iterative  procedure  that  guarantees  a  monotonic  increase  of  the 
likelihood  function  for  a  given  set  of  training  samples. 

Another  optimization  criterion  is  to  maximize  the  stale- 
optimized  likelihood  function  defined  by: 

p(0,  Q*|A)  =  m^p(0,  (5|A) 

T 

=  mM7r,,6„(oi)J^a„  ,„fe,,(o,)  (3.3) 

^  t  =  2 

where  Q*  -  {<7^,  (?2,  •  , }  is  the  optima!  state  sequence 

associated  with  the  state-optimized  likelihood  function,  and  q* 
is  the  tth  state  in  this  optimal  state  sequence.  Equation  (3.3) 
is  the  density  of  the  optimal  or  most  likely  state  .sequence 
path  among  all  possible  paths.  The  estimation  of  the  param¬ 
eters  using  this  enterion  is  given  by  the  segmental  -means 
algorithm  [9],  (lO).  This  algorithm  is  an  iterative  procedure 
that  guarantees,  under  some  conditions  described  later,  the 
monotonic  increa.se  of  the  slate-optimized  likelihood  function 
for  a  given  set  of  training  samples. 

Comparing  (3.2)  and  (3.3),  we  find  out  that  (3.2)  involves 
computation  along  all  possible  state  paths,  while  (3.3)  tracks 
only  the  most  likely  path.  TTierefore,  the  computation  required 
by  (3.3)  is  much  less  than  that  of  (3.2).  Also,  since  6,, (o,) 
often  has  a  large  dynamic  range,  overflow  or  underflow  is  more 
likely  to  happen  in  evaluation  of  (3.2)  than  in  evaluation  of 
(3  .3)  Furthermore,  in  a  particular  application,  if  the  data  fit  the 


model  very  well,  all  the  observation  samples  of  one  class  are 
likely  to  have  few  dominant  stale  sequences.  It  means  that  the 
optimal  slate  sequence  in  (.3.3)  carries  a  lot  of  information  that 
may  discriminate  one  class  from  another.  For  these  reasons, 
we  have  chosen  maximization  of  (3.3)  as  our  criterion. 

Problem  2 — Classification:  To  solve  our  signal  classifica¬ 
tion  problem,  we  create  one  HMM  for  each  class.  For  a 
classifier  of  P  classes,  we  denote  the  P  models  by  Ap, 
p  —  1,  2,  •  •  ■ ,  P.  When  a  signal  O  of  unknown  class  is  given, 
we  calculate: 

p*  =  argmaxp((A,  (3*|Ap)  (3.4) 

p 

and  classify  the  signal  as  belonging  to  class  p*. 

Now  we  can  immediately  see  one  of  the  advantages  of 
HMM.  The  model  for  one  class  is  independent  of  the  model  for 
any  other  class,  i.e.,  the  Gaining  for  one  class  is  not  related  to 
the  training  for  any  other  class.  It  follows  that  when  a  new  class 
is  added  to  the  classifier,  we  need  only  to  train  for  this  new 
class,  but  do  not  have  to  retrain  for  any  other  class.  In  general, 
this  advantage  is  not  associated  with  a  neural  net  classifier. 

For  a  given  A,  an  efficient  method  to  find  p(0,  Q*|A)  is  the 
well-known  Viterbi  algorithm  (11),  [12]  as  described  below. 

Viterbi  Algorithm 
Step  1.  Initialization 
For  \  <  i  <  N, 


Ov 

II 

(3.5) 

i/>i(i)  =  0. 

(3.6) 

Step  2.  Recursive  computation 

For  2  <  t  <  r,  for  1  <  j  <  N, 

Stij)  ^  max  16,_i(i)a,j)l)_,(o,) 

1  <t<  A/ 

(3.7) 

V'((j)  =  arg  max  1  (ijn,,). 

1  <  t  <  At' 

(3.8) 

Step  3.  Termination 

P'  =  max 

\<j<N 

(3.9) 

q’j-  =  arg  max  ((^t(*)]- 

1  <  1  <  A»’ 

(3.10) 

Step  4.  Tracing  back  the  optimal  state  sequence 

For  t  =  T  -  1,  7'  -  2,  ,  1, 

9(  —  V’(  +  1  (Vn  1 ) 

(.3  11) 

/’*  is  the  state-optimized  likelihtxxl  function,  and  Q'  - 
(<7i,  92'  ■  ■  '  }  >■''  optimal  state  sequence. 

In  practice,  as  t  increases,  the  value  of  fi(j)  could  be  very 
large  or  very  small  so  that  an  overflow  or  underflow  may  occur 
during  computation  on  a  computer.  To  avoid  this  problem, 
we  take  the  logarithm  of  all  probabilities  and  densities,  and 

replace  all  multiplications  by  additions.  Obviously,  the  result 
of  trading  the  optimal  state  sequence  is  not  affected  by  this 
modification.  If  any  particular  value  is  zero,  we  set  it  to  a  very 
small  number  such  that  it  d(x;s  not  affect  the  result. 
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Problem  J — Training:  In  creating  the  model  for  each  cla.ss, 
we  should  guarantee  that  the  parameters  we  obtain  are  the 
optimum  for  a  given  set  of  training  samples.  Since  our  decision 
rule  is  the  state-optimized  likelihood  function,  it  requires 
that  the  estimated  parameter  A  be  such  that  p(0,  Q*|A)  is 
maximized  for  the  training  set.  It  is  shown  in  [9]  that  the 
segmental  /f-means  algorithm  (10)  converges  to  the  state- 
optimized  likelihood  function  for  a  wide  range  of  observation 
density  functions,  including  the  Gaussian  density  we  have 
assumed.  The  algorithm  is  described  below. 

1)  Cluster  all  training  vectors  into  N  clusters  using  the  min¬ 
imum  distance  rule  with  random  initial  clustering  centroids. 
Each  cluster  is  chosen  as  a  state  and  numbered  from  1  to  N. 
The  <th  vector,  Ot,  of  a  training  sequence  O  is  assigned  to  state 
i,  denoted  as  oj  e  i,  if  its  distance  to  state  i  is  smaller  than  its 
distance  to  any  other  state  j,  j  /  i.  The  distance  measure  we 
have  used  is  the  unweighted  Euclidean  distance.  This  step  is  to 
get  a  good  initialization  for  the  complete  training  procedure. 

2)  Calculate  the  mean  vector  and  covariance  matrix  for  each 
state.  For  1  <  i  <  Af, 

‘o,ei 

K  =  -  A.)‘(ot  -  A.)  (3.14) 

*  OtCi 

where  N,  is  the  number  of  vectors  assigned  to  state  i. 

3)  Calculate  the  transition  and  initial  probabilities.  For 

1  <  t  <  Af, 

.  _  Number  of  occurrences  of  {oi  €  i}  (3  15) 

Number  of  training  sequences 

For  \  <  t  <  N  and  I  <  j  <  N, 


B.  Multilayer  Perceptrons 

Multilayer  perceptrons  (MLP)  are  feedforward  nets  with  one 
or  more  layers  of  nodes  between  the  input  and  output  layers. 
The  lowest  layer  is  the  input  layer,  which  does  not  have  any 
processing  capability.  The  highest  layer  is  the  output  layer, 
and  any  layer  between  the  input  and  output  layers  is  called 
the  hidden  layer.  All  the  nodes  in  a  layer  are  connected  to  the 
nodes  in  the  layer  above  it,  and  there  is  no  connection  within 
a  layer  or  from  the  higher  layer.  For  example,  a  three-layer 
perceptron  is  shown  in  Fig.  4.  The  perceptron  processing  unit 
performs  a  weighted  sum  of  its  input  values  Ui 


where  {lu.j},  {uj*}  are  the  weight  matrices  and /(■)  is  usually 
a  nonlinear  function  such  as  the  sigmoid  function 

/{a:)  =  l/(l  +  e-"). 

Generally,  the  multilayer  perceptrons  are  trained  with  the 
error  backpropagation  (EBP)  algorithm  [15]  which  is  an  it¬ 
erative  gradient  algorithm  designed  to  minimize  the  mean 
square  error  (MSE)  between  the  desired  output  t/JJ  and  actual 
output  yk-  Sometimes,  a  momentum  term  is  also  included  in 
the  training  procedure.  The  details  of  this  algorithm  can  be 
found  in  (14).  In  addition  to  MLP’s,  other  neural  networks 
such  as  Kohonen  feature  maps  have  also  been  used  in  pattern 
recognition.  A  good  introduction  to  the  neural  nets  is  given  by 
Lippmann  [13],  and  an  excellent  exposition  of  the  NN  is  given 
by  Hecht-Nielsen  [17].  For  a  useful  survey  on  NN’s  and  their 
foundations,  paradigms,  applications,  and  implementations, 
see  [18],  [22]. 

IV.  Implementation  Considerations 


(lij 

Number  of  occurrences  of  {o,  G  i  and  0(+i  G  j}  for  all  t 
Number  of  occurrences  of  {o,  G  i}  for  all  t 

(3.16) 

4)  Calculate  density  functions  of  each  training  vector  for 
each  state.  For  1  <  y  <  N, 


!>,(<<, ) 


exp 


-  ihWj  -  ih)' 


.  (3.17) 


Here  Mi  is  the  dimension  of  the  feature  vector. 

.“i)  Use  the  Viterbi  algorithm  and  the  new  probabilities  to 
trace  the  optimal  state  sequence  Q’  for  each  training  sequence. 
A  vector  is  reassigned  a  state  if  its  original  state  assignment  is 
different  from  the  tracing  result,  i.e.,  assign  o,  G  r  if  v*  -  ? 

6)  If  any  vector  is  reassigned  a  new  state  in  .Step  5.  use 
the  new  state  assignment  and  repeat  Step  2  through  .Step  5; 
otherwise  stop 


A.  Training  of  Classifiers 

Each  signal  template,  i.e.,  exemplar,  is  divided  into  a 
sequence  of  partially  overlapping  segments.  Each  segment  is 
then  represented  by  one  feature  vector.  The  .sequence  of  feature 
vectors  is  used  as  one  training/testing  observation  sequence  for 
the  HMM.  For  the  MLP-NN,  the  whole  sequence  of  feature 
vectors  is  used  as  the  training  vector.  For  example,  if  there 
are  20  four-dimensional  vectors  in  the  sequence,  these  80 
features  are  used  as  training  fcaurcs  for  the  MLP-NN,  and  the 
MLP  NN  is  designed  with  80  input  nodes.  The  MLP-NN  has 
one  hidden  layer,  and  it  is  trained  using  the  backpropagation 
algorithm  and  sigmoidal  nonlincanty.  The  MMM's  are  trained 
using  segmental  A;-mcans  algorithm  (11)  as  described  in  the 
previous  section.  For  each  signal  cla.ss,  one  HMM  is  designed. 
During  recognition,  the  test  signal  is  matched  against  all 
models  to  find  the  best  match.  The  matching  is  done  by  the 
Viterbi  algorithm  j  I  j,  [12|.  Fig  3  depicts  the  implement. ition 
ot  HMM  For  more  details  about  the  number  of  points  in  each 
signal  template,  the  number  of  frames  in  a  signal  template, 
the  percent.agc  of  overlap  among  successive  frames,  and  the 
niimtx'r  of  signal  classes,  refer  to  .Section  V 
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Test  sample 
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Fig.  3.  Implementation  of  the  HMM  classifier,  (a)  Training  phase,  (b)  Testing. 
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Fig  4.  A  three-layer  perceptron. 


The  AR  coefficients  are  computed  using  Burg’s  algorithm. 
The  gain  coefficient  is  not  used  due  to  a  scaling  problem. 
The  range  of  the  “gain”  coefficient  is  much  much  larger  than 
the  AR  coefficients.  The  gain  is  given  by  \/^  [(2.1)].  There 
are  sophisticated  techniques  to  overcome  this  problem  and  get 
better  results.  For  instance,  in  [5]  a  product-code  HMM  is  used 
that  can  take  the  gain  coefficient  into  account.  Daubechies-4 
and  Daubechies-20  wavelet  coefficients  are  used  to  compute 
the  wavelet  transform.  As  explained  in  Section  II.  for  each 
signal,  only  a  few  wavelet  coefficients  with  high  magnitude 
are  used  as  features.  Similarly,  for  each  signal,  only  a  few 


FFT  power  specu-al  coefficients  with  high  magnitude  are  used 
as  features.  The  union  of  the  coefficients  for  all  different  signal 
types  constitutes  the  feature  vector. 

B.  Initial  Clustering  Center  and  Local  Maxima  for  HMM 

In  Section  III,  we  have  assumed  that  the  feature  vectors 
have  a  normal  disuibution  within  each  state.  The  global 
convergence  property  of  the  segmental  A'-means  algorithm 
is  based  on  this  assumption.  Although  this  is  a  practical  and 
reasonable  assumption,  when  the  number  of  training  samples 
is  not  sufficiently  large,  the  data  may  not  conform  to  this 
assumption  very  well.  A  better  solution  to  this  problem  is 
replacing  the  Gaussian  density  by  a  mixture  of  Gaussian 
densities,  but  this  will  greatly  increase  the  complexity  of  the 
model  and  therefore  will  be  computationally  very  costly.  If 
we  do  not  change  our  model  but  carefully  choose  the  initial 
cluster  centers  in  Step  1  of  the  training  algorithm  (Section 
III)  [11],  we  may  still  reach  the  global  maximum.  Thus,  in 
the  training  procedures,  we  will  try  different  initial  cluster 
centers  and  select  the  set  of  parameters  that  results  in  the 
largest  average  P‘  over  all  training  samples  of  that  class. 

V.  Experimental  Results 
A.  Signal  Description 

We  have  used  DARPA  standard  data  set  I  for  our  experi¬ 
ments.  This  data  set  provides  seven  classes  of  signals  to  test 
our  algorithm.  A  typical  example,  one  from  each  class,  is 
shown  in  Fig.  2.  We  denote  these  signal  classes  as: 

Class  A:  Broadband  15-ms  pulse 

Class  B:  Tivo  4  ms  pulses,  27  ms  separation 

Class  C:  3  kHz  tonal,  10  ms  duration 
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TARLK  I 

CoNiusiON  Matrix.  IIMM  Ciassuter.  and 
AR  COEmCENT  I'EATURES  (SIXTU-OrDER) 

Chosen  Oass 


TABU-  II 

CoNiiisioN  Matrix.  HMM  Classiijer.  and 
AR  Coefficient  Features  (Tfjitu-Ordfr) 
Chosen  Class 


True  Qass 

A 

B 


E 

F 

N 


ThieQass 

A 

B 

C 

D 

E 

P 

N 


•  6th-order  AR  model 

•  6-state  HMM;  8-state  is  worse 

•  Recognition  accuracy  =  73.3  % 

•  AR  model  may  not  be  a  good  fit  for  this  data 

•  AR  model  has  problem  modeling  a  pure  sinusoid 

Class  D:  3  kHz  tonal,  100  ms  duration 
Class  E;  150  Hz  tonal,  1  s  duration 
Class  F:  250  Hz  tonal,  8  s  duration 
Class  N:  Ocean  ambient  noise. 

We  have  created  45  templates,  i.e.,  exemplars,  for  each 
class,  of  which  23  are  used  as  training  templates  and  22 
as  test  templates.  Each  signal  template  contains  1024  data 
points.  The  sampling  rate  for  the  signal, is  24.576  kHz.  For 
this  sampling  rate,  1024  data  points  are  enough  to  capture 
'.he  essential  characteristics  of  all  transient  types  including  the 
Class  B  type  signal,  which  has  the  most  time  spread.  This 
1024-poinl  signal  template  is  divided  into  21  frames  of  256 
data  points  with  an  overlap  of  2 1 8  points  (approximately  85%) 
between  two  successive  frames.  Once  the  feature  vectors  arc 
computed  from  each  frame,  the  signal  template  is  represented 
by  a  feature  vector  sequence.  The  training/testing  sets  include 
exemplars  from  five  different  SNR  groups.  The  lowest  SNR 
is  24  dB  down  with  respect  to  the  highest  SNR.  Tlie  first 
group  is  the  reference,  i.e.,  0  dB,  group.  The  other  groups  arc 
created  adding  background  noise  to  this  reference  group,  and 
the  SNR  values  for  these  groups  arc  -6,  -12,  -18.  and  -24 
dB,  respectively,  with  respect  to  the  reference  0  dB  group. 
Tlic  SNR  is  computed  as  the  ratio  of  the  peak  signal  power  to 
background  noise  power  expressed  in  dB.  As  a  result,  some 
very  noisy  exemplars  arc  included  in  our  experiments.  Most 
classifiers  can  handle  signals  with  relatively  high  SNR  quite 
well,  but  fail  wiih  low  SNR  signal.  A  meaningful  evaluation 


•  lOth-order  AR  model 

•  6-state  HMM 

•  Recognition  accuracy  =  67.5% 

of  a  classifier  is  possible  only  when  the  classifier  can  classify 
low  SNR  exemplars  with  high  accuracy.  Another  important 
distinction  in  our  experiment  is  that  we  have  included  ocean 
noise  as  a  separate  class.  In  DARPA  data  set  I,  ambient  noise 
has  a  frequency  spectrum  that  substantially  overlaps  with  that 
of  types  A,  B,  E,  and  D  signals.  Thus,  the  inclusion  of  ambient 
noise  as  a  separate  class  makes  our  classification  problem 
much  harder. 

We  have  tried  a  different  number  of  states  for  HMM,  from 
N  =  2  lo  N  =  12,  and  a  different  number  of  nodes,  from 
10  to  30,  in  the  hidden  layer  of  the  MLP-NN.  We  have  also 
compared  the  results  of  AR  models  of  different  orders,  from 
M  =  2  to  M  =  10.  Only  the  best  results  are  reported  in  the 
paper  and  the  accompanying  tables.  Table  I  shows  the  number 
of  errors  in  classifying  the  total  154  test  exemplars  using  AR 
features  and  HMM’s.  The  best  result  is  obtained  with  the 
six-state  HMM,  and  sixth-order  AR  model.  As  stated  before, 
the  gain  coefficient  is  not  used  mainly  becau.se  of  the  scale 
problem.  Table  II  gives  the  result  of  the  same  experiment  with 
ten-order  AR  model.  The  recognition  accuracy,  i.e.,  percentage 
of  correctly  classified  test  exemplars,  in  both  these  experiments 
Is  rather  poor.  This  poor  showing  of  AR-model-bascd  features 
could  be  attributed  to  two  possible  explanations:  1)  the  AR 
model  may  not  be  a  good  fit  for  DARPA  data  set  1;  2)  the 
AR  model  implemented  with  Burg's  algorithm  has  a  problem 
modeling  a  pure  sinusoid  [19).  It  can  be  clearly  seen  that  the 
AR  model  has  great  difficulty  in  discriminating  type  E  and  type 
F  .signals — two  single-frequency  tonals  with  close  frequencies, 
Wc  have  also  found  that  the  AR  order  beyond  6  is  not  helpful 
as  the  extra  poles  try  to  match  the  spurious  peaks  due  to  iKcan 
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TAHLH  III 

(\)NHJsioN  Matrix,  IIMM  Ci  assihck,  and  l  l'r  I  rAriwi  N 
Chosen  Class 

True  Class 

A 

B 

C 

D 

B 

F 

N 


•  30  FFT  features 

•  8-state  HMM 

•  Recognition  accuracy  =  89.6  % 

•  Recognition  accuracy  =  95.5  %  when  class  for  ocean  noise 
is  excluded. 

noise.  It  is  conceivable  that  some  performance  improvement  is 
still  possible  with  AR-model-based  features  [5];  however,  as 
we  will  show,  the  FFT  power  spectral  features  and  Daubechies 
wavelet  based  features  hold  mote  promise  for  our  classification 
task.  Please  note  that  the  results  reported  in  [2]  and  [6]  are  also 
not  favorable  for  AR-model-based  features. 

Table  III  shows  the  number  of  errors  in  classifying  the 
total  154  test  exemplars  using  FFT  power  spectral  features 
and  HMM’s.  The  best  result  is  obtained  with  the  eight-state 
HMM’s.  The  recognition  accuracy  is  now  close  to  90%. 
When  the  ocean  ambient  noise  is  excluded,  the  recognition 
accuracy  is  over  95.5%.  Tables  IV  and  V  show  the  number 
of  errors  in  classifying  the  total  154  test  exemplars  using 
wavelet-transform-based  features  and  HMM’s.  The  best  re¬ 
sult  is  obtained  with  an  eight-state  HMM’s.  The  recognition 
accuracy  is  now  above  90%. 

Table  VI  shows  the  number  of  errors  in  classifying  the  total 
154  test  exemplars  using  FFT  power  spectral  features  and 
MLP-NN.  The  best  result  is  obtained  with  20  nodes  at  the 
hidden  layer.  The  recognition  accuracy  is  now  above  90%. 
Tables  VII  and  VIII  show  the  number  of  errors  in  classifying 
the  total  154  test  exemplars  using  wavelet-transform-based 
features  and  MLP-NN.  Once  again,  the  recognition  accuracy 
is  above  90%.  In  particular,  the  MLP-NN  classifier  and 
Daubechies-20  transform  feature  combination  has  achieved  the 
best  individual  performance — 93.5%  classification  accuracy. 

We  have  also  experimented  with  Icfl-to-right  HMM's.  'Ibble 
IX  shows  the  number  of  errors  in  classifying  the  total  154  test 
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TABI.B  IV 

CONiusioN  Matrix,  IIMM  Ciassitor,  and  l)Aimi<TiiiLs  4  I'Vaiiiri.s 
Chosen  Class 


True  Qass 

A 

D 

C 

D 

E 

F 

N 


•  30  features 

•  8-state  HMM 

•  Recognition  accuracy  =  91 .5  % 

TABLE  V 

Confusion  Matrix.  HMM  CiAssiFn*.  and  DAUBEaiiES-20  Features 
Chosen  Class 


Thie  Class 

A 

B 

C 

D 

E 

F 

N 


30  features 
8-sUle  HMM 

Recognidon  accuracy  ■  90.9  % 


exemplars  using  wavelet-transform-based  features.  ’The  best 
result  is  achieved  with  Daubechies-4  transform  features,  and 
is  reported  in  Tbblc  K.  ’This  best  result— 88.9%  classification 
accuracy — is  somewhat  inferior  to  the  results  achieved  by 
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TABU;  VI 

t'oNHJSio.9  Matrix,  NN  Ci.a.isifter.  and  HT  Fkatdrrs 
Chosen  Class 

ABODE  F  N 

Thie  Class 

A 

B 

C 

D 

E 

F 

N 

•  30  features 

•  20  nodes  for  the  bidden  layer 

•  Recognition  accuracy  =  90.9  % 

TABLE  Vn 

Confusion  Matrix,  NN  Classifier,  and  Daubechies-4  Features 
Chosen  Class 

A  B  C  D  E  F  N 

True  Class 

A 

D 

C 

D 

E 

F 

N 

•  30  features 

•  20  nodes  for  the  hidden  layer 

•  Recognition  accuracy  =  90.9  % 

fully  connected  ffMM’s.  Another  important  point  is  that  the 
initialization  process  as  described  in  Section  HI  is  only  good 
for  fully  connected  HMM's.  For  left-to-right  HMM’s,  the 
initialization  process  needs  to  be  defined  in  terms  of  an 


TABLE  vm 

Contusion  Matrix,  NN  Ciassiher.  and  DAUBH:iiiwi-20  l-FAn/Hij 
Chosen  Class 


•  30  features 

•  20  nodes  for  the  hidden  layer 

•  Recognition  accuracy  =  93.5  % 

initial  guess  of  A  and  B  probability  parameters.  This  latter 
initialization  process  is  considerably  more  diflicult,  and  needs 
more  intimate  knowledge  of  the  data. 

B.  Combined  Classifier 

From  the  confusion  matrices  given  by  Tables  ni-Vin,  it 
is  clear  that  every  feature/classifier  combination  has  a  some¬ 
what  different  performance.  A  pertinent  question  is — can  we 
combine  the  evidence  of  all  the  feature/classifier  combinations 
to  yield  results  that  would  be  superior  to  any  specific  fea¬ 
ture/classifier  combination?  Such  a  combined  classifier  would 
also  be  more  robust.  One  simple  way  to  combine  the  feature 
vectors  is  to  extract  AR,  FFT-based,  and  wavelet  coefficients 
from  each  frame,  and  then  form  a  laige  vector  which  would 
be  the  input  to  cither  a  HMM  or  MLP-NN  based  classifier.  In 
our  case,  this  solution  would  mean  a  66-dimensionaI  floating 
point  feature  vector  for  each  frame.  This  tremendous  increase 
in  computational  complexity  can  be  avoided  by  intelligent  use 
of  the  classifier.  A  product-code  HMM  as  described  in  [5]  can 
incorporate  all  three  different  feature  vectors  in  one  classifier 
without  substantial  increase  in  the  complexity.  Unfortunately, 
these  three  feature  sets  arc  not  independent  of  each  oilier  as 
required  by  the  theory  of  product-code  HMM. 

We  have  devised  a  simple  cla.ssificr,  henceforth  called  the 
majority  classifier,  that  would  take  the  output  of  each  specific 
fcaturc/classificr  combination  and  assign  the  test  exemplar  the 
class  with  the  majority  votes  only  when  the  vote  exceeds 
a  threshold.  Since  we  have  six  votes  per  test  exemplar,  we 
choose  a  threshold  of  3  and  4.  If  the  majority  vote  is  below 
this  threshold,  that  test  exemplar  is  not  classified.  The  detailed 
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TABLE  IX 

Confusion  Matrix,  L-R  HMM  Classiher,  and  Daubechies-4  I-eatures 

Chosen  Class 


•  30  features 


TABLE  X 

Fused  Conrision  Matrix,  Majority  Voti  Nit  uid  4 
Chosen  Class 


•  Misclassi£ed  =  2 

•  No  decision  =  12 


•  8-State  HMM 


*  X  means  no  decision 


•  Recognition  accuracy  =  88,9  % 


TABLE  XI 

Fused  Confusion  Matrix,  Majority  Vote  Needed=3 


experimental  results  are  given  in  Tables  X  and  XI,  When 
the  threshold  is  4,  only  two  test  exemplars  are  misclassified, 
but  12  are  not  classified.  When  the  threshold  is  3,  five  test 
exemplars  are  misclassified,  but  only  three  are  not  classified. 
It  is  very  clear  that  the  exemplars  that  would  otherwise  be 
classified  erroneously  are  now  classified  as  “nonclassified”  by 
the  combined  classifier.  Also,  very  few  test  exemplars  are 
misclassified  by  the  combined  classifier.  Thus,  in  the  case 
of  a  definitive  decision,  the  combined  classifier  has  close  to 
100%  recognition  accuracy.  For  test  signals  with  low  SNR, 
the  combined  classifier  would  not  make  a  wrong  decision.  For 
the  lack  of  consistent  evidence,  it  would  term  the  signal  as 
“nonclassified.” 

It  is  possible  to  implement  the  classified-versus- 
nonclassified  decision  for  the  individual  classifier  by  finding 
an  optimum  threshold.  For  a  HMM  classifier,  the  computation 
of  this  optimal  threshold  requires  some  knowledge  about  the 
distribution  of  the  observation  sequence  corresponding  to 
optimal  Viterbi  state  sequence  given  the  HMM  parameters. 
For  an  MLP-NN,  the  distribution  at  the  output  node  is 
needed.  While  these  distributions  are  very  difficult  to  find, 
the  majority  classifier  simplifies  the  threshold  computation 
problem  enormously  especially  when  the  output  of  each 
classifier/feature  combination  is  given  equal  weight.  Also, 
the  best  performance  for  a  classifier/feature  combination  is 
given  by  the  NN/Daubechies-20  combination.  In  this  case, 
10  out  of  154  exemplars  are  wrongly  classified.  In  the  case 
of  a  majority  classifier  with  threshold  3,  five  exemplars  are 
wrongly  classified  and  three  exemplars  are  not  classified. 


Chosea  Class 


•  Misclassified  =  5 

•  No  decision  =  3 

•  X  means  no  decision 

classified,  the  performance  of  the  majority  classifier  is  still 
superior  to  any  individual  classifier. 

It  is  interesting  to  compare  our  data  and  results  to  those 
of  Ghosh  et  al.  [2]  and  Desai  el  al.  [3],  Although,  the  same 
DARPA  data  set  is  used  in  [2],  (3),  only  four  signal  types 


Even  if  we  consider  the  “nonclassified”  exemplars  as  wrongly  are  considered  in  [3].  In  [2],  noise  as  a  .separate  class  is 
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noi  considered.  As  we  have  mentioned  before,  the  addition 
of  ocean  noise  as  a  separate  class  makes  the  classification 
of  DAf<PA  data  set  1  very  diflicult.  Another  important  point 
is  that,  in  our  technique,  no  preprocessing  is  used;  but  such 
processings  are  used  in  [2],  (3].  Also,  the  experiments  in  (2) 
use  signals  from  DARPA  data  set  I  test  data  set.  Since  we  have 
no  access  to  the  “truthing”  of  these  data,  wc  have  focused 
our  experiment  on  DARPA  data  set  I  training  data  set.  The 
“truthing”  of  all  the  signals  in  this  set  is  known.  Thus,  a  proper 
comparison  is  not  possible  at  this  point  though  our  technique 
has  performed  quite  well,  which  validates  our  approach.  Also, 
a  number  of  conclusions  from  our  work  confirm  two  main 
observations  made  in  [2].  These  are:  1)  both  FFT-  and  wavelet- 
based  features  are  promising;  and  2)  a  combined  classifier  is 
likely  to  yield  better  results. 

VI.  Concluding  Remarks  and  Fui^ipE  Research 

Based  on  the  experimental  results,  the  following  conclusions 
are  in  order: 

1)  Both  FFT-  and  wavelct-ba.sed  features  are  promising. 

2)  To  a  certain  extent,  the  wavelet-ba.sed  features  comple¬ 
ment  the  FFT-based  features. 

3)  To  a  certain  extent,  the  HMM  classifier  complements  the 
NN  classifier. 

4)  The  combined  classifier  has  the  best  result.  Also,  the 
combination  is  very  robust.  Only  a  simple  combination  is  de¬ 
scribed  in  the  paper.  Other  possible  combinations  of  HMM/NN 
classifier  should  be  explored. 

5)  For  longer  signals,  HMM-based  classification  could  prove 
more  effective.  The  longer  signals  are  likely  to  have  more 
sequence  information.  The  exploitation  of  this  sequence  infor¬ 
mation  is  the  rationale  for  using  HMM. 

6)  Features  based  on  other  wavelets,  such  as  Gabor 
wavelets,  may  prove  more  effective.  This  is  another  interesting 
future  research  topic. 
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